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Amendments to the Claims : 

This listing of the claims will replace all prior versions and listings of the claims in 
the application: 

Listing of Claims : 

1 . (Currently Amended) A noise reduction system wkh including an audio- 
visual user interface therein , said syst e m being specially adapt e d for running an application 
for combining visual features (e V 7»z) extracted from a digital video sequence (v(«7)) showing 
th e face of a sp e ak e r (S ^ with audio features extracted from an analog audio sequence 

(s(t)) 9 wh e r e in said audio sequ e nc e (s(t)) can include including background noise in the an 
environment of said a speaker (£/), said noise reduction system (200b/c) comprising! 

[[-]] audio sequence detection means (101a ; 106b) for detecting said analog audio 
sequence; and 

audio feature extraction and analysis means for analyzing said analog audio sequence 
tetfPk and extracting said audio features therefrom; 

[[-]] video sequence detection means (101b ? ) for detecting said video sequence 
(v(/i7)) ? and ; 

[[-]] visual feature extraction and analysis means (I01a+b, 10 4 '+104") for analyzing 
the detected video sequence signal (v(;?7)), and extracting said visual features therefrom; 

wh e r e in a noise reduction circuit (106) of said nois e r e duction syst e m is adapt e d 
configured to separate [[the]] a speaker's voice from said background noise (n\t)) based on a 
combination of derived speech characteristics (ggy,, ? r •= [Q^^v^y^f) and outputting 
configured to output a speech activity indication signal ( s, (nT) ) which is obtain e d by 

comprising a combination of speech activity estimates supplied by said audio feature 
extraction and analysis means and said visual feature extraction and analysis analyzing means 
ri06b, I01a+b, lO'T+lOl"), ; and characterized by 

a multi -channel acoustic echo cancellation unit (108) b e ing sp e cially adapt e d 
configured to perform a near-end speaker detection and double-talk detection algorithm based 
on acoustic phonetic the speech characteristics derived by said audio feature extraction and 
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analyzing analysis means (106b) and said visual feature extraction and analyzing analysis 
means (104a+b, 104M-101") . 

2. (Currently Amended) A noise reduction system according to claim 1, 
characterized by further comprising: 

means (SW) for switching off an audio channel in case th e actual lovol of if said 
speech activity indication signal (s,(;?r)) falls below a predefined threshold value. 

3. (Currently Amended) A noise reduction system according to anyon e of th e 
claims 1 or 2, characterized in that claim 1 a wherein said audio feature extraction and 
analyzing analysis means (106b) is comprises an amplitude detector. 

4. (Currently Amended) A near-end speaker detection method for reducing the 
noise l e v e l of in a detected analog audio sequence (s(t)) , said method b e ing charact e rized by 
th e following st e ps comprising : 

[[-]] subj e cting (SI) converting said analog audio sequence (s(f)) to an analog to into 
a digital conv e rsion, audio sequence; 

[[-]] calculating (S2) th e a corresponding discrete signal spectrum (£(/<: ■ A/)) of the 
analog to digital converted audio sequence (s(nT)) by performing a Fast Fourier Transform 
(FFT)[[,]]; 

[[*]] detecting (S3) the a voice of said a speaker from said discrete signal spectrum 
(£(/<: ■ A/)) by analyzing visual features (e^*?) extracted from a simultaneously with th e 
recording of th e analog audio s e quenc e (s(t)) r e cord e d video sequence (v(nT)) tracking th e 
associated with the audio sequence and including current location locations of the speaker's 
face, lip movements and/or facial expressions of the speaker ($) in subs e qu e nt a sequence of 
images in the video sequence [[,]]; 

[[-]] estimating (S 4 ) the a noise power density spectrum (<!>„„(/)) of [[the]] 
statistically distributed background noise («(0) based on [[the]] result of the sp e aker 
d e t e ction st e p (S3), detection of the voice of the speaker; 
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[[-]] subtracting (S§) a discretized version (<& nn (k - A/)) of the estimated noise power 
density spectrum (<D „„(/)) from the discrete signal spectrum (S(hAJ)) of the analog to digi- 
tal conv e rt e d audio sequence (s(nT)), to obtain a difference signal; and 

[[-]] calculating (S6) the a corresponding discrete time-domain signal (s, (nT)) of the 

obtained difference signal by performing an Inverse Fast Fourier Transform (IFFT) , th e r e by 
yi e lding a discr e t e v e rsion of th e to provide a recognized speech signal. 

5. (Currently Amended) A near-end speaker detection method according to 
claim 4, characterized by th e st e p of further comprising: 

conducting (S7) performing a multi-channel acoustic echo cancellation algorithm 
which models echo path impulse responses by means of adaptive finite impulse response 
(FIR) filters and subtracts echo signals from the analog audio sequence (s(t)) based on 
acoustic-phonetic speech characteristics derived by an algorithm for extracting the visual 
features from [[a]] the video sequence (v(»7)) tracking th e location associated with the 

audio sequence and including the locations of [[a]] the sp e aker's face, lip movements and/or 
facial expressions of the speaker (5£ in subs e qu e nt a sequence of images in the video 
sequence . 

6. (Currently Amended) A near-end speaker detection method according to 
claim 5, characterized in that wherein said multi-channel acoustic echo cancellation 
algorithm performs a double-talk detection procedure. 

7. (Currently Amended) A near-end speaker detection method according to 
anyon e of th e claims 4 to 6 claim 4 , characterized in that wherein said acoustic-phonetic 
speech characteristics are based on [[the]] detecting opening of a sp e ak e r's mouth of the 
speaker as an estimate of [[the]] acoustic energy of articulated vowels [[or]] and/or 
diphthongs, r e spectiv e ly, detecting rapid movement of the sp e ak e r's lips of the speaker as a 
hint to labial or labio-dental consonants, r e sp e ctiv e ly, and and/or detecting other statistically 



In re: Mono Taneda 

International Appl. No. PCT/EP2004/000104 
International Filing Date: January 9, 2004 
Page 7 

d e t e ct e d phonetic characteristics of an association b e tw ee n associated with position and 
movement of the lips and th e and/or voice an d and/or pronunciation of said speaker 

8. (Currently Amended) A near-end speaker detection method according to 
anyon e of th e claims A to 7, characterized by claim 4, wherein detecting the voice of said 
speaker comprises: 

a l e arning proc e dur e us e d for e nhancing th e step of detecting (S3) the voice of said 
speaker (S£ from the discrete signal spectrum (S(/oA/)) of the analog to digital conv e rt e d 
version (s(nT)) of an analog audio sequence (s(/)) using a learning procedure by analyzing the 
visual features (#v?«?) extracted from a simultan e ously with th e r e cording of th e analog audio 
s e qu e nc e (s(t)) r e cord e d the video sequence (y(nT)) tracking th e associated with the audio 
sequence and including the current location locations of the sp e ak e r's face, lip movements 
and/or facial expressions of the speaker (5$ in subsequ e nt a sequence of images in the video 
sequence . 

9. (Currently Amended) A near-end speaker detection method according to 
anyon e of th e claims 1 to 8, characterized by the step of claim 4, further comprising: 

correlating (S8a) the discrete signal spectrum (g ^/c - A/)) of a delayed version (s(nT x)) 
of the analog to d igital convert e d audio signal (s(nT)) with an audio speech activity estimate 
obtained by [[an]] amplitude detection (S8b) of [[the]] a band-pass-filtered discrete signal 

spectrum (iS^/oA/)), th e r e by yi e lding to provide an estimate (£,(/)) for [[the]] a frequency 
spectrum (£,#)) corresponding to [[the]] a signal (#<{*)) which represents said sp e ak e r's a 
voice of said speaker as well as an estimate (O^if)) for the noise power density spectrum 
(O^ (/)) of the statistically distributed background noise (;i '(()) ♦ 

10. (Currently Amended) A near-end speaker detection method according to 
claim 9, characterized by th e st e p of further comprising: 

correlating (S9) the discrete signal spectrum (S *(hAJ)) of [[a]] the delayed version 
(s(nT t)) of the analog to digital converted audio signal (s(rtT)) with a visual speech activity 
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estimate taken from a visual feature vector supplied by the visual feature extraction and 
analyzing means (10 4 a+b, 101'+10 4 "), thereby yielding to provide a further estimate 

' (/)) for updating the estimate (S, (/)) for the frequency spectrum (S/(#) corresponding 
to the signal (#,{*)) which represents said speaker's voice as well as a further estimate 
(O w/I ? (/)) for updating the estimate (O „„(/)) for the noise power density spectrum 
(O m (/)) of the statistically distributed background noise (;i'(0) - 

1 1 . (Currently Amended) A near-end speaker detection method according anyon e 
of th e claims 9 or 10, characterized by th e step of to claim 9, further comprising: 

adjusting (S10) the cut-off frequencies of a band-pass filter (201) used for filtering the 
discrete signal spectrum (S(hAf)) of the analog to digital convert e d audio signal (s(/)) 
d e p e ndent sequence based on [[the]] a bandwidth of the estimated speech signal frequency 
spectrum (»S , / (/)) . 

12. (Currently Amended) A near-end speaker detection method according to 
anyon e of the claims 1 to 8, characterized by th e st e ps of claim 4, further comprising: 

[[-]] adding (SI la) an audio speech activity estimate obtained by [[an]] amplitude 
detection of [[the]] a band-pass-filtered discrete signal spectrum (£(/<: ■ A/)) of the analog to 
digital conv e rt e d audio signal (s(t)) sequence to a visual speech activity estimate taken from a 
visual feature vector [[(o v ,/)]] supplied by said visual feature extraction and analyzing means 
(I01a+b, 10'T+IO'I"), th e r e by yi e lding to provide an audio-visual speech activity estimate, 

[[-]] correlating (SI lb) the discrete signal spectrum (S(hAJ)) with the audio-visual 
speech activity estimate , th e r e by yi e lding to provide an estimate (S i (/)) for [[the]] a 
frequency spectrum (£*#)) corresponding to [[the]] a signal (#<{*)) which represents said 
sp e aker's a voice of said speaker as well as an estimate (<£„,(/)) for the noise power density 
spectrum (<!>„„(/)) of the statistically distributed background noise ; and 

[[-]] adjusting (SI lc) th e cut-off frequencies of a band-pass filter (201) used for 
filtering the discrete signal spectrum (S(hAf)) of the analog to digital conv e rt e d audio signal 
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dependent sequence based on [[the]] a bandwidth of the estimated sp ee ch signal 
frequency spectrum (fl, (/)) . 

13. (Currently Amended) A telecommunication system, comprising: 
Us e of a nois e r e duction syst e m (200b/c) according to anyon e of th e claims 1 to 3 and 
a n e ar e nd sp e ak e r d e tection m e thod according to anyon e of th e claims 5 to 1 3 for 
a video-enabled phone; 

a video-telephony based application in a telecommunication system running on [[a]] 
the video-enabled phone with ; and 

a built in video camera (101b') built-in to the video-enabled phone and pointing at 
[[the]] a face of a speaker (S£ participating in a video telephony session, 
wherein said video-telephony based application comprises: 

audio sequence detection means for detecting an analog audio sequence; 

audio feature extraction and analysis means for analyzing said analog audio 
sequence and extracting said audio features therefrom; 

video sequence detection means for detecting said video sequence; 

visual feature extraction and analysis means for analyzing the detected video 
sequence and extracting said visual features therefrom; 

noise reduction means for separating a speaker's voice from said background 
noise based on a combination of derived speech characteristics and outputting a 
speech activity indication signal comprising a combination of speech activity 
estimates supplied by said audio feature extraction and analysis means and said visual 
feature extraction and analysis means; and 

multi-channel acoustic echo cancellation means for performing a near-end 
speaker detection and double-talk detection algorithm based on the speech 
characteristics derived by said audio feature extraction and analysis means and said 
visual feature extraction and analysis means . 
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14. (Currently Amended) A telecommunication device equipped with an audio- 
visual user interface , characterized by and including the noise reduction system (200b/c) 
according to anyone of th e claims 1 to 3 claim 1 . 

15. (New) A telecommunication system configured to perform the near-end 
speaker detection method of claim 4. 



